Data analysis is at the core of data science, enabling professionals to extract insights, identify patterns, and make evidence-based decisions. This course equips you with practical tools and techniques to analyze structured and unstructured data from various sources, preparing them for real-world applications such as predictive modeling, trend analysis, and text mining. Similarity will pop up in many algorithms & fields (for example: clustering, recommendation systems, KNN, RAG...)
This course covers foundational and advanced topics in data analysis, from data collection and preprocessing to the application of analytical models. You will learn about web data extraction, file format handling, time series, text mining techniques, and similarity measures.
Overview of the data analysis process, key concepts in data science, types of data, data cleaning basics, and exploratory data analysis (EDA) techniques using visual and numerical summaries.
Techniques for extracting data from websites and online platforms, understanding APIs, using tools like BeautifulSoup or Scrapy, and addressing ethical considerations in web scraping.
Understanding the structure of XML and JSON files, parsing techniques in Python, transforming nested data, and integrating data from external web APIs and files.
Analyzing temporal data, identifying trends and seasonality, time series decomposition, forecasting methods like ARIMA, and applications in business and finance.
Using natural language processing to classify text sentiment, understanding lexicon-based vs. machine learning approaches, and applying sentiment models to product reviews, tweets, or news articles.
Techniques for preprocessing textual data (tokenization, stemming, stopword removal), feature extraction methods like TF-IDF, and identifying patterns or topics within large text corpora.
Review of linear regression, introduction to multiple regression, model evaluation techniques, assumptions checking, and using regression in predictive analytics.
Understanding dynamic programming as an algorithmic strategy, computing edit distance between strings, applications in spell checking, plagiarism detection, and bioinformatics.
Measuring similarity between text documents or data vectors using cosine similarity, Jaccard index, Euclidean distance, and understanding their roles in clustering and recommendation systems.
Representing documents in vector space, applying TF-IDF weighting, calculating similarity, and using the model in information retrieval and ranking tasks.
Don’t worry about knowing everything at once—focus on understanding the logic behind each method and how it’s used in real-world problems. Small projects or side exercises can really help solidify the concepts.